From 3f6940ab3739589d01af033c47b973decd150b5f Mon Sep 17 00:00:00 2001 From: ewen Date: Sun, 28 Sep 2025 22:42:32 +0000 Subject: [PATCH] Added a comment: Feed seems to now be parsed as UTF-8 characters, not binary mode --- ..._9982bda0b8b224edd2300083f7e1ec00._comment | 31 +++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 doc/bugs/importfeed__58___Enum.toEnum__123__Word8__125____58___tag___40__8217__41___is_outs/comment_5_9982bda0b8b224edd2300083f7e1ec00._comment diff --git a/doc/bugs/importfeed__58___Enum.toEnum__123__Word8__125____58___tag___40__8217__41___is_outs/comment_5_9982bda0b8b224edd2300083f7e1ec00._comment b/doc/bugs/importfeed__58___Enum.toEnum__123__Word8__125____58___tag___40__8217__41___is_outs/comment_5_9982bda0b8b224edd2300083f7e1ec00._comment new file mode 100644 index 0000000000..56b0b23315 --- /dev/null +++ b/doc/bugs/importfeed__58___Enum.toEnum__123__Word8__125____58___tag___40__8217__41___is_outs/comment_5_9982bda0b8b224edd2300083f7e1ec00._comment @@ -0,0 +1,31 @@ +[[!comment format=mdwn + username="ewen" + avatar="http://cdn.libravatar.org/avatar/605b2981cb52b4af268455dee7a4f64e" + subject="Feed seems to now be parsed as UTF-8 characters, not binary mode" + date="2025-09-28T22:42:32Z" + content=""" +I think the relevant change is likely to be: + +``` +* feed (update: parseFeedFromFile uses openBinaryFile, updated git-annex to open + the file itself instead) +``` + +from [https://git-annex.branchable.com/bugs/35_failed_tests_on_beegfs/#comment-d7e4cf0592937215e3acd3c08c03288c](https://git-annex.branchable.com/bugs/35_failed_tests_on_beegfs/#comment-d7e4cf0592937215e3acd3c08c03288c) + +Based on the fact that's a 2025-09-04 change (so since previous release), refers to `parseFeedFromFile`, and the relevant commit seems to be: + +[http://source.git-annex.branchable.com/?p=source.git;a=commit;h=2b1e9eced2fe825c882b4e9549a3a12f41d08055](http://source.git-annex.branchable.com/?p=source.git;a=commit;h=2b1e9eced2fe825c882b4e9549a3a12f41d08055) + +and particular in this file: + +[http://source.git-annex.branchable.com/?p=source.git;a=blobdiff;f=Command/ImportFeed.hs;h=e36e72370204ece44a05bfae5954272a46f34f5c;hp=7b66a2b5077613b7e33dc8597a8272e7fdea7102;hb=2b1e9eced2fe825c882b4e9549a3a12f41d08055;hpb=56cd59a9f4e24c5a6842179e0da9180875d837cc](http://source.git-annex.branchable.com/?p=source.git;a=blobdiff;f=Command/ImportFeed.hs;h=e36e72370204ece44a05bfae5954272a46f34f5c;hp=7b66a2b5077613b7e33dc8597a8272e7fdea7102;hb=2b1e9eced2fe825c882b4e9549a3a12f41d08055;hpb=56cd59a9f4e24c5a6842179e0da9180875d837cc) + +My reading of that code is that the feed parsing switched from (implicitly) \"just bytes\" (`openBinaryFile`) to decoding UTF-8 into full UTF-8 characters, but there's either (a) something in the later git-annex code or (b) the XML parser that does not expect to receive non-ASCII Unicode characters resulting from opening in \"character\" mode rather than \"binary\" mode, resulting in out of range values. + +Which results in the crash on encountering the first non-ASCII character in the feed :-/ + +It's not clear to me why in fixing \"set close-on-exec bit on open files\" the feed parsing was changed from bytes (binary mode) to decoded characters. But it appears it wasn't tested on feeds where the text has been through a wordprocessor throwing in smart quotes and smart dashes and the like all over the place. + +Ewen +"""]] -- 2.30.2